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Abstract 

We consider the prediction error of linear regression with l\ regularization when the number of covariates p 
is large relative to the sample size n. When the model is fc-sparse and well-specified, and restricted isometry or 
similar conditions hold, the excess squared-error in prediction can be bounded on the order of " k , where 
a 2 is the noise variance. Although these conditions are close to necessary for accurate recovery of the true 
coefficient vector, it is possible to guarantee good predictive accuracy under much milder conditions, avoiding 

the restricted isometry condition, but only ensuring an excess error bound of order fcl °gM +cr^J ^astsT Here 
we show that this is indeed the best bound possible (up to logarithmic factors) without introducing stronger 
assumptions similar to restricted isometry. 



1 Introduction 

We consider a random design linear regression problem with p covariates: 

y = x T /3* + z 

where x <E W are random covariates with covariance matrix S, z is random noise with E [z 2 ] = cr 2 , and 
/3* € K p are the regression coefficients. For simplicity we take the response to be normalized, E [y 2 ] = 1 
(otherwise all results scale accordingly). 

We consider the problem of minimizing the prediction error 

E[(y-x T 0) r 

based on an i.i.d. sample (a^ 1 * 1 , y^) ■ ■ ■ , {x^ n \ j/"-*) using l\ -regularized regression: 

/3 B = arg min ^(v^ ~ ^) • 

II ft II 1 — ^ 

Note that up to some unknown and data-dependent correspondence between B and A, this is the same as 

$ x = argmin^ (y^ - x^ T + A U/% , 
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also known as Lasso regression [Tibshirani, 1996]. 



Suppose that the covariates are 1-bounded, and th at maxj \ y^}_ I < O (log(np)) (for instance, this is true with 
high probability in the Gaussian setting). Then, bv lSrebro et al.l 12010], with high probability over the sample, 
for any fixed (3* with || i < B, excess squared-error under ^-regularized regression is bounded as 



(y-x T p*) 



O 



(l + B) 2 \og(p) , (l + Bflogip) 



n/log (n) 



n/\og [n) 



(1) 



This result does not require any conditions on the correlation between the covariates, or on the nature of the 
"noise" y — x T (3*, aside from the mild bound on max^ \y^\. In particular, this noise is not required to be 
independent from x. We believe also that this result would hold for subgaussian x's (rather than our current 
stronger assumption that the x's are 1-bounded). 



We can apply this result to the sparse regression setting, with some mild additional assumptions. Suppose that 
we are interested in comparing to a sparse predictor on an unknown support J* C \p], with | J* | < k. We now 
place a lower-bound eigenvalue assumption on this support J* only: 



A min (E [xj*x T j*]) > Ai > 



(2) 



where x j« = (xj : j G J* ) is the random vector consisting of those covariates Xj for which (3* is nonzero. This 
assumption is strictly weaker than the restricted isometry property (RIP) conditions in the compressed sensing 
literature, which require an upper-bound assumption as well, and require the eigenvalue bounds to hold for all 
sets J C [p] of bounded size, in addition to the true support J* . 

We fix the scale of the problem by assuming E [y 2 ] = 1. Now consider a predictor (3* with support in S* , 
which is better than the zero predictor — that is, E [(y - x T (3*) 2 ] < E [(y - a; T p ) 2 ] = E [y 2 ]=l. We now 



show that ||/?* 



O 



We first bound \\(3* || 2 , by observing that 



\W*\\l-X 1 < (/T) T E [xx T ] (3* 



E 



E 



{y-x T p*f -2E[y(y-x T f3*)] + E [y 2 



< 2E 



(y~x T (3*) 2 +2E[y 2 ] <4 



We then have 



ii/nii < >/k\\n < ^k-A\^ 1 = o (^fkx^j 



Therefore, with high probability, 

2" 



E 



(y-x T P B )' 



E 



{y-x T (3*) 



O 



k log(p) 



k log(p) 



Am/log^n) I/ \ in /\og 6 {n) 



(y-x T /3*y 



(3) 



under the assumption that the x's are 1-bounded and max^ | < 0(np). Therefore, to guarantee a bound of 
e on the excess prediction error, the required sample complexity is 



(4) 



where a 



(y~x T (3*y 



is the magnitude of the noise. This sample complexity follows an "optimistic 



rate": in the noisy setting, if we would like to ensure a bound e on excess error which is small relative to a 2 , 
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then the required sample complexity is then n = (e 2 ), but on the the other hand, in the noiseless setting 
(i.e. when y — x T ' j3*), or if the bound on excess error e is not much smaller than a 2 , then we require only 



= 0(e 



We emphasize that this result does not assume that the linear model is a true model or require 



independent noise. 



In contrast, results on sparse vector recovery from the compressed sensing framework BCandes and 



Bick el et al. , 2009 , Koltchinskii , 20091 Cai et al. , 2009ll provide stronger guarantees in a similar setting, using 
either ^-regularized regression or the Dantzig selector, given by 



(3® s = arg min max 



(i) 



r (i)T 



A||/3|| 



These stronger results require several additional specialized assumptions, including the requirement that the 
noise must be independent from the signal. Existing results are stated either in the deterministic or random 
covariates setting, but can in general be translated to a random Gaussian setting. We restrict our attention 

to ^-regularized regression when the covariates are i.i.d. multivariate Gaussian with zero mean: l 
N(0, S). We now summarize this setting (with some simplifications), and compare it to the optimistic-rate 
results discussed above. 



• Well-specified model with independent subgaussian noise: Response is given by yW = x^ T j3* + 

az^ for a true predictor /?* satisfying ||/3*||i < B, andz^ is a subgaussian or subexponential noise term 
with unit variance, and is independent from x^ . 

The main additional requirement here is that noise z is independent of x. This in particular implies that (3* 
is the optimal regressor. Note that in order to obtain the optimistic -rate guarantee ||3}, no such assumption 
is necessary, and j3* can be a non-optimal regressor chosen for its sparsity or eigenvalue properties. 

• Sparsity: /3* is fc-sparse, meaning that it has (at most) fc non-zero entries. 

To obtain the optimistic-rate guarantee as stated originally in (Q~|), we can relax this requirement and only 
assume that f3* has low £i-norm. 

• Restricted eigenvalues: There exists a k = K(k, 3) > 0, such that for any J c [p] with | J| < fc, for any 
nonzero /?€ RP with ||/3j||i < 3||/3j||i, 



/3 T S/3 > k \\Pj\\ 2 2 



(5) 



This restricted eigenvalue condition is implied by a stronger condition: 

Restricted isometry: Suppose that 5 2 k + 39k,2k < 1, where v T T,v € (1 ± 62k) IMIi f° r a U 2fc-sparse 
vectors v, and |w T Sw| < 0&,2fc||i>||2||u'||2 for all fc-sparse v and 2fc-sparse w with disjoint supports. Then 

K = \/l ~ ^2fe fl — T-sll ) satisfies the restricted eigenvalue condition above. 

To obtain the optimistic-rate guarantee (0) under the sparsity assumption, we required an eigenvalue 
condition (O on £support(^*) only, which is strictly weaker than the restricted eigenvalue and restricted 
isometry assumptions. 



Under these as sumptions, with k defined as in ((3), the following guarantees hold with high probability, by 
Theorem 7.2 of iBickel et all ll2009h : 



Sparse and accurate estimation of j3* : 
Bounded excess prediction error: I 



j3 B -13" 



y-x T {3 B 



= E 



ok ,/log(p) 



O I l^.^l'-^l. I , and||/3 s || = 0(fc) 



{v-x T fr) 



o 



cr 2 fclog(p) \ 
K 2 n J 



(6) 
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This corresponds to a sample complexity of 



to ensure an excess error bound of e. It is crucial to note that the error bound (and the sample complexity) 
scales with the magnitude of the noise, a 2 , rather than to the (unit) magnitude of the signal. In particular, in 
a noiseless setting, the results above guarantee a zero-error reconstruction of f3* , in contrast to the "optimistic 
rate" result (f3]l where no such guarantee is given. Furthermore, in the noisy setting, the compressed sensing 
guarantees give a "fast rate" result, since the sample complexity scales with =■ rather than with -V 

In this compressed sensing framework, the guarantees on predictive error follow from a stronger guarantee 
on the accurate recovery of /?*, and in particular, the recovery of the true support of f3* . In order for this to 
be possible, it is of course necessary to be able to distinguish between pairs or small sets of covariates. In 
particular, some sort of restricted isometry assumption is clearly necessary for bounding error in recovering (3* 
(otherwise, the "best" f3* might not be unique). However, if the goal is merely low error in prediction — that is, 
we would like accuracy in calculating x T '(3* , rather than in recovering (3* — then perhaps this assumption could 
be weakened. For example, if a covariate is duplicated in the model, then it will not be possible to distinguish 
between the two when attempting to recover the true support; however, adding duplicated covariates to a model 
will have no effect on the problem of prediction. 

More generally, we are interested in whether the properties that are necessary for the (unique) recovery of 
f3* , are also necessary to obtain strong bounds on excess pred iction error, and in the role of the assumptions 
that separate the "optimistic rate", unit-scale error bounds of ISrebro et al. I ll201(lll from the "fast rate" error 



bounds in the compressed sensing literature, which scale with the magnitude of the noise. Below, we show that, 
if we remove either the sparsity assumption (while still assuming that (3* has low £i-norm) or the restricted 
isometry assumption from the compressed sensing framework described above, then up to logarithmic factors, 
the "optimistic rate" bound on excess prediction error, given in Q, is the best possible bound. In particular, 
this implies that, even in the noiseless setting, we cannot achieve zero error in prediction, without stronger 
assumptions. 



2 Results 



First, we ask whether we can relax the assumption of a sparse true coefficient vector to an assumption on its 
^i-norm, but still guarantee a fast-rate bound on excess error. Specifically, we consider the question of bounding 
excess prediction error, in the well-specified Gaussian setting where the restricted eigenvalue assumption holds, 
assuming only an ^i-norm bound on the true vector of coefficients. 



Our first result shows that, up to logarithmic factors, the optimistic -rate error bound ([3]) is the best possible 
rate under these conditions. For simplicity, we will consider the case of completely independent covariates, 
x ~ 7V(0,Ip). In particular, this ensures that the restricted eigenvalue assumption is satisfied. To place the 
problem on a unit scale (or rather, to bound the scale away from zero and away from infinity), we consider only 
true coefficient vectors (3* satisfying 



2 ~ 



1/2 



(x T p*y = 11012 < < 1 



Theorem 1. Fix any n > 30, p > 3n, and a > 0. Then there exists a f3* eP with \ < \\f3*\\ 2 < \\ f3* || 1 < 1, 
such that for any sample, for all B > 0, 



E 



T oB 



y-x (3 



> a' 



1 



32nlog^(3n) 
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Additionally, if 100 < \/™/<t < p, then with probability at least i over the sample, for all B > 0, 



102400^/nlog 2 (max {3n, [V~/a"|}) ' 

Here B = argmin|| / 3|| 1 <s (y® — x^ T 0) , where (x^\y^) are i.i.d. samples from the multivari- 
ate Gaussian distribution defined by drawing x^> ~ iV(0, I p ) and j/W ^ N (x^ T 0* ,a 2 ^. The expecta- 
tions are taken over a new sample (x, y) drawn from the same distribution, independently of the training set 
{ (x^ , j/W) , . . . , (a;( n ) , j/™-*) }. (For eac/z B > 0, zs nof unique, then we show that the inequalities hold 
for some choice of B .) 




Next, we ask whether the restricted eigenvalue (or restricted isometry) assumption is necessary for a fast-rate 
bound on excess error, in the well-specified Gaussian setting where the sparsity assumption holds. 

Our second result shows that, up to logarithmic factors, the optimistic-rate error bound (01 is the best possible 
rate under these conditions. For simplicity, we restrict our attention to 2-sparse true coefficient vectors. We also 
only consider covariance matrices £ such that Sj* = where J* = Support (/?*). That is, ensuring the 
restricted isometry property on the true support only, is not sufficient for a fast-rate bound on excess error. 

To avoid issues of scaling, we restrict our attention to covariance matrices £ with ||S|| sp < 2, and to true 
coefficient vectors 0* satisfying 



2 ~ 



1/2 



(x T f3*y = v//?* T £/3* = ii/ni 2 < inii < i , 



where we make use of the fact that Ej» = I,/* to obtain the second equality. 

Theorem 2. Fix any n > 30, p > 3n, and a > 0. Then there exists a 2-sparse 0* £ M. p with i < ||/3*||2 < 



11/3* 111 < 1> and a positive semi-definite £ G 
for any sample, for all B > 0, 



p with ||£|| sp < 2 and £g u 



pport(,8* 



(p.), such that 



T qB 



288nlog z (3n) 

Additionally, if 100 < Vn/a < p — 3, then with probability at least ^ over the sample, for all B > 0, 



E 



(y-x T B )' 



> a' 



409600^ log 2 (max{3n, 



Here B — argmin||^|| 1 <s X)i (v^ ~~ x^ T ff) , where [x^\y^) are i.i.d. samples from the multivari- 
ate Gaussian distribution defined by drawing x^> ~ iV(0, £) and y( % > ~ N [x^ T 0* ,ct 2 ). The expecta- 
tions are taken over a new sample (x, y) drawn from the same distribution, independently of the training set 
{ (a^ 1 ) , j/ 1 -*) , • • • , (or- 71 ' , y^) }. (For each B > 0, if B is not unique, then we show that the inequalities hold 
for some choice of B .) 



In particular, Theorem|2]shows that without placing any assumptions on the covariates outside of S upport (0* 



we ca nnot guarantee a bound on excess error that is better than the optimistic rate obtained by Sreb ro et al. 
[2010] from concentration bounds, up to logarithmic factors. 
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3 Proofs 



We begin by defining a class of predictors that are optimal with respect to the squared-error loss and the ^i-norm 
regularizer: 

Definition 1. Given yW e K and £ W for i = 1, . . . , n, a predictor /3 £ W is Pareto-optimal (with 
respect to empirical squared-error and ^i-norm) if it satisfies 

^( y W-xW r /3) 2 <^(yW-xW^) 2 =► MU^Ml, 

i i 

that is, if we cannot improve its empirical squared error without increasing its ^i-norm, and vice versa. 

The following lemma states a well-known property of i\ -regularized regression; we include a proof for com- 
pleteness. 

Lemma 1. For any i/ 1 ' , . . . , j/W g R a nd x^\ . . . , x^ n ' € W, for any B > 0, f/ie class 



B B = arg min V f y« - z^ 7 ^ 
must contain a predictor f3 B that is Pareto-optimal and satisfies ||/3 s ||o < Ji. 



Proof. Let Err^ = inf ||^|| 1 <b Si (V*' — x^ T /3) . Since {||/3||i < -B} is a compact set, this infimum is 
attained by some (3 with ||/3||i < -B. Now define 

S' = inf | ||/3||i : £ (y« _ jW^) 2 < Err B | < B . 
Again, by compactness, this infimum i s attained by some (3. We then see that /3 is Pareto-optimal by its 



construction. Finally, by Theorem 3 of iRosset et al. 1 2004 1. there exists a /3 B g R p such that ||/3 s ||o < ^, 
||^ s ||i < 1 1/? 1 1 i, and X(3 B = X/3. This is sufficient. □ 

Next we state two additional lemmas, proved in the next section. 

Lemma 2. Fix n and p with n > 30 and p > 3n. Let xW l '~ ' A/(0, E)/or some E e R pxp , and let /3* € M p 
be fixed. Then with probability at least 1 — 2e~ ni ° s ( p \ for all J C [p] with \ J\ = n, 

XjX((3-(3*) 2 < ||E|| sp ■ 16V2 -nlog(p) • \f {f3 - P*) T Y,((3 - f3*)for all (3 e W with &j = 0, (8) 

w/iere f/ie matrix X has entries Xij = Xj , and Xj consists of the columns of X indexed by j 6 J. 

Lemma 3. Let x^ % ' % ~' N(0,T,) for some E € M pxp , and let z £ R™ be fixed, with \\z\\1 > 0.5n. Assume 
v^/<t > 100. 77ie« w/f/z probability at least 1 - e- 015 ' 7 ' 1 ^, for all J x C [|V«/V|] vwf/i |Ji| > 



2 . ^ in (E, 7l )n 3 / 2 

2" 200a ' {> 



Proj^p^z) 

where the matrix X has entries Xij — Xj, and Xj t consists of the columns of X indexed by j G J\ 
We now prove the theorems. 
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3.1 Proof of Theorem ffl 



Theorem^ Fix any n > 30, p > 3n, and a > 0. Then there exists a (3* e W with | < \\/3*\\ 2 < \\/3*\\i < 1, 
such that for any sample, for all B > 0, 



E 



2/ - ^ T /3 B 



1 



32nlog^(3n) 

Additionally, if 100 < v^/ct < p, then with probability at least = over the sample, for all B > 0, 

21 rr 



E 



Here 



argraini|y3|| 1 <B W — PY , where (x^ l \y^) are i.i.d. samples from the multivari- 
ate Gaussian distribution defined by drawing a;W ~ N(0,I p ) and j/W ~ TV (a;W T /3*, a 2 ). The expecta- 
tions are taken over a new sample (x, y) drawn from the same distribution, independently of the training set 

) } . (For each B > 0, if $ B is not unique, then we show that the inequalities hold 

for some choice of f3 B .) 



102400^ log 2 (max{3n, [v^/ CT ]}) 



(10) 



(11) 



Proof. Let /3* be 



(3* = 



1 



j ■ 41ogp' 



1,. 



>p - 1; /3» = 



Note that < 1 and > \, and so the resulting distribution satisfies the desired assumptions. t 

By LemmaQ] for any B > 0, the set argiimi^^^ J2i (v^ ~ x^ T 0) must include a Pareto-optimal vector 
j3 B with ||/3 B ||o < n. Therefore, it is sufficient to show that bounds ( TTOb and ( fTTT ) hold for all Pareto-optimal 
vectors /3 with \\$\\o < n. We now prove these two bounds separately. 



Proof of Q3D. For any (3 with ||/3||o < n, we have 



/9 — 



> 



> 



16 log 2 (p) 



/' 

x— n+l 



j • 41ogp 



— dx 



> E 

1 ( 1 



1 



j • 41ogp 



161og 2 (p) V« + ! P/ ~ 32nlog 2 (p) 



> E 

1 



1 



j ■ 41ogp 



This proves the claim when p = in. However, the claim is immediately true for any larger value of p, since 
we may add in an arbitrary number of zero covariates (and assign zero coefficients to these covariates), without 
affecting the results. 



Proof of (HD- By Lemma 1 of lLaurent and Massartl il2000h . with probability at least 1 - e -° 0625 « > 0.75, 
IMIl ~ Xn > 0.5n. For the remainder of the proof, we treat 2 € M. n as a fixed vector, and assume ||z||| > 0.5n. 

Assume that © holds for all J C [p] with | J| = n, and (O holds for all J\ C [|V"/V|] with | Ji| > V™^. (By 
Lemmas|2]and[2 this is true with probability at least 1 - 2e~ ,llog ( p ) - e-O-OiSo-- 1 ^ > q 7g ^ Now cnoose any 
Pareto-optimal f3 with ||/3||o < ft. 
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Suppose that \\0 - < 102400v ^ log 2 (p) ■ First, we show that 



{j G [[V*/o]} : % > 0} 



> 



Suppose not. Then 



1 



> 



'J/ — 161oi?(p~) j 2 - 16 log 2 (p) 



> 



1 

16 log 2 (p) 



U r - 1 I 1 1 ^ — 1 

x^ adj - 16 log 2 (p) \\V*/2<S\ 2p^27j y 161og 2 (p)-2rv^/2 CT ] - 32yWlog 2 (p) 



> 



This is a contradiction. 

Now define Ji = jj e [[v^/o-]] : /3j > o|, and fix any J D Support (/3) with \ J\ = n. Since $ is Pareto 
optimal with positive entries j3j for all j E J\, we have 



a 08 



/1, 



1011 



Therefore, by the theory of Lagrange multipliers, we must have Xj y — Xj X/3 = C ■ lj lt for some C € 
We then have 



Xj i X(/3-/3*) = C r-Xj i *-C'-l /l 



(12) 



By dHJ, the norm of the left-hand side of (fT2l can be bounded from above as 



X T ,X{fi~p*) < Xj x X0-P*) < ||S|| sp • 16V2 • nlog(p) /3*) T E(/3 - /?*) 



By (0, the norm of the right-hand side of (fT2l can be bounded from below as 



\a ■ Xj ± z - C ■ lj^ >a 



Proj^Xj^ 



■.-. / ^(S Jl )n' /a 
2 - V 200cr 



Therefore, returning to (fT2l . we have 

||E|| sp • 16^2 • nlog(p) • ^/(/3-/3*) r E(/3-/?*) > Xj^CS - /T) 



((T-Xjz-C-ljJL ><r 



200ct 



Therefore, 



(/3-/3*) J E(/3-/3*) > 



' ^min 



102400|| S|| 2 . ^log 2 ^) 102400^ log 2 (p) 



This proves the claim when p = max {3n, [v 7 "/^] }■ As in the proof of ( flOl ), this is sufficient to prove the claim 
for any larger value of p. 



□ 
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3.2 Proof of Theorem 



Theorem^ Fix any n > 30, p > 3n, and a > 0. Then there exists a 2-sparse /3* <E 
11/3* II l < 1> an d a positive semi-definite Sel 
for any sample, for all B > 0, 



with i < ||/3* 



< 



with ||E|| ap < 2 and E Support(/ 3.) = I Su pport(^*), such that 



E 



V 



x T p B 



> a 1 



1 



288n log z (3n) 

Additionally, if 100 < V™/a < p — 3, then with probability at least | over the sample, for all B > 0, 



(13) 



E 



(y-x T /3 B ) ; 



Here 



argminu/ju^B^i (y M 



> (7 



409600^ log z (max{3n, \^/<r}}) 



(14) 



0) , where (; 



.(0,3,(0) 



are i.i.d. samples from the multivari- 



ate Gaussian distribution defined by drawing x^ 1 ' ~ iV(0, E) and ~ A (x^^ T /3*, cr 2 ). The expecta- 
tions are taken over a new sample (a:, y) drawn from the same distribution, independently of the training set 



{( 



(D ;y (l) 



>),..., 6 



,(«) „(n) 



')}. (For each J5 > 0, if $ B is not unique, then we show that the inequalities hold 



for some choice of ft .) 



Proof. Let wi, W2, Ui, ■ ■ • , w p -3 ~ A(0, 1). Define 

r = 4io^-(ri-^) eRP " 3 ' 

Since p > 90, ||r||i < | and ||t||| < 9 lo ^ ^ < 0.01. Now we define an additional covariate as a linear 
combination of the others: 




Now define x = (ux, ■ ■ ■ , u p -3, v, wi,W2). Let E = Cov(x), and note that cr max = ||E||2 < 2. 
Define 

Sparse = (o p _a,0, -, -) , /3^„ se = ^-^_== . r, -^__, ,0^ 

and 

V (l) =\{Wl+ W2) = X {l)T fc P arse = ^ ' Pdense ■ 



Note that /3 s * pQrse and (3* dense are both optimal predictors. Since ||/? s * parse || i = 1, l3* parse T ^/3* parse = ||/3* parse ||i 
i, and f3* parse is 2-sparse, this distribution satisfies the desired assumptions. However, 

ll ^- lll = V2(i-m (1 + IMIl) "^ <1 ' 

and so in a sense /3j ense will be preferred to fi* sparse in l-y -regularized regression, thus leading to the same 
arguments as in the proof of TheoremQ] 
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By LemmaQ] for any B > 0, the set argmiri|| | g|| 1 <s ^ 4 (y^ — x^ T 0) must include a Pareto-optimal vector 
B with ||/3 B ||o < n. Therefore, it is sufficient to show that bounds ( TT~3T > and ( fT4b hold for all Pareto-optimal 
vectors p with ||/3||o < n. For each such /3, we use the notation 



\0u J 0W\ t 0W2 1 0v ) £ 



o-3 



X K X K X 



Observe that, by definition of the covariates, 

(0-0 sparse Yn0-0 sparse J 



(15) 



0u - T0 t 



(Term 1) 



(Term 2) 



(Term 3) 



The remainder of the proof is organized as follows. First, we prove bounds (TT~3T > and ( TBI for any Pareto-optimal 
with V < i. Next, we prove the bound ([T3T > for any Pareto-optimal with ||/3||o < 1 and p\, > |. Finally, 
we prove the bound (fl4l > for any Pareto-optimal /3 with |j/3||o < « and V > -|. 



Proof of ( Il3l and dT4t when /3„ < |. Consider any Pareto-optimal p with p\, < |. First, suppose that 

By the definition of the covariates, x^ T — x^ T p for all i. We will now show that 1 < ||/3||i. We have 

||/3||i = ||&||i + 1/9^1 + |A 0a |+|&| 



/? = A, + 



0u 



0u 



2VT 1 



0wi 



1 

V2 



2V2 



A, 



20HMI1 



2V1HMI 



< 



20^ 



Mi 



0W1 



1 

V2 



= Plll-^ 



27iqMjf 20: 



<\\0h-^ 



V2 2\/l-0.01 2 2V 



Therefore, this case leads to a contradiction, since we have constructed a coefficient vector with zero error on 
the training set, and lower £i-norm than 0, Therefore, we must have either Wl < or W2 < Without 

loss of generality, we assume Wl < 



Then 



111 1 

< -- 



W1 + ^1 M\10 V ^<^> V z 3 



and so by (Term 2) in ( fT3T > above, 



(0 ~ 0*s P arse) T n0 ~ 0* spa rse) > [ 0^ + ^= V 1 ~ IMI2& ~ ^ ) ~ 72 



This is sufficient to show that both (\3[ and ( fl4b are satisfied. 
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Proof of ([13) when (3 V > i. Consider any Pareto-optimal (5 with ||/3||o < Ji and /?„ > 4. We then have 



> 



g. vuiuium unj ion,iu upumiu p vvii.ii ||p|||J _i " "™ f II ^ 3- 

2 f-f V J 41og(p) j 



3=1 

p-3 



a > E 



1 1 



41og(p) j 



7 -A, 



p-3 



1 



cZx 



1 



161og>) ^ j 2 " 1441og>) J x =„ x 2 1441og» \n p - 3 J " 288n log>) 



1 1 



> 



But, considering (Term 1) in (fT~5T > above, this proves that 



(/^ ^sparse) ^(/^ (^sparse) — 



288n log^O) 



This proves the claim when p = 3n. As in the proof of Theorem[T| this is sufficient to prove the claim for any 
larger value of p. 



Proof of dT4l when j3 v > -L By Lemma 1 of Laurent and Mas s art [2000], with probability at least 1 



-0.0625n 



> 0.75, H^lll ~ Xn > 0.5n. For the remainder of the proof, we treat z e R™ as a fixed vector, and 



assume \\z\\i > 0.5n. 



Assume that © holds for all J C [p] with |J| = n, and Q holds for all Ji C U^/a]] with |Ji| > V"/2<r. (By 
Lemmas |2] and [2 this is true with probability at least 1 - 2e"" log ( p) - e -° 015,T_1 v / " > 0.75.) Consider any 
Pareto-optimal f3 with ||/3||o < w and ^ > i, First, suppose that 



{j G : % > O} 



< 



Then 



p-3 , 

3=1 ^ 

- gn! V 
" 16 log 2 (p) 

je[rv^Al]:^<o 



41og(p) j 



E 



1 i-A 



4 log(p) j 



1 > i .. v 1 

i3 — T3JT7^7?77a / j ~p 



J 1 — 144 log 2 (p) / y 



> 



144 log 2 (p) 



r(ix 



144 log 2 (p) I [v^/2<t] 2pS727j J 144 log 2 (p)-2rv^"/2 CT ] — 288^ log 2 (p) ' 



Considering (Term 1) in ( [13] ). this proves that 



(/^ ft sparse) ^(/^ ft sparse) — 



/3« - r/3„ 



> 



288V7Tlog 2 (p) 



Next, suppose instead that 



> 



2(7 
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Define J\ = |j e [[v 7 ™/'?]] : fij > 1, and fix any J D Support(/3) with |J| = n. Since /3 is Pareto-optimal 
with positive entries f3j for all j £ J\, we have 

d 



l/?lli = U • 



Therefore, by the theory of Lagrange multipliers, we must have X^y — X^X/3 ~ C ■ lj lt for some C € 
We then have 



X3 1 X(0-F) = a-Xj 1 z-C-l Jl . 
By (U, the norm of the left-hand side of ([Tol l can be bounded from above as 

X T h X0-p*) < XIX0-/3*) < \\E\\ sp ■ 16V2 • nlog(p) • y/ - /3*) T S(/3 - /?*) 



(16) 



By (O, the norm of the right-hand side of (|T6b can be bounded from below as 

||cr • X T Jx z - C ■ I./, L > a- 



Proj^Xj^ 



2 ~ V 200cr 



Therefore, returning to (1 1 6t . we have 



Therefore, 



||S|| sp • 16V2 ■ nlog(p) • ^/(4-/8*) T E(^-^) > - /?*) 



It ^ r 1 II >^ Mmin^iK' 2 

k-x^-c-uL^/ — 



(/3-/3*) T E(/3 -/?*)> 



102400|| S|| % ■ Vnlog 2 (p) 409600 Vn log 2 (p) 



This proves the claim when p = max {3n, }. As in the proof of TheoremQ] this is sufficient to prove the 

claim for any larger value of p. 

□ 



4 Proofs for Lemmas 



4.1 Proof of Lemma H 



Fix any J C [p] with |J| = n. We will show that, with probability at least 1 - 2e _2nlog(p) , 



XjX0 -13*) < ||E|| sp • 16^2 • nlog(p) ■ J(/3 - /3*) T S(^ - /?*) for all /3 € R p with /3- = . 



Since there are (?j < p" choices for the set J, this will be sufficient to prove the lemma. 
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Reorder the covariates to write S = ( ^ JJ ^ JJ J . Choose a Cholesky decomposition 

(u v \ T ( U V 

\ W J \ w 
Let a« ~ N(0, I p ). Then ( U ^ \ a w - iV(0, £), and so 



( Xj * 7 ) = ( AjU AjV + A 7 W ) , 

where the matrix A has entries Ay = a*- , and Aj consists of the columns of A indexed by j e J. We then 
have 

XjX0 - 1 U T A T j (AjU0 - f3*)j + (AjV + AjW)0 - /T) 7 ) 

= U T A T j (AjU0 {AjV + A 7 W)p* 7 ) = U T A T j (Aj (u0 - ~ Vft) - AjW0$) 



Below, we will show that 



\\Aj\\ S p < y/l6nlog(p) with probability at least 1 - e - 2 ™ log ^ , 

and ||AjW/?5-|| < ^/l6nlog(p) ■ |[Wj8j|[ 2 with probability at least 1 - e" 



2nlog(p) 



(17) 
(18) 



Assuming that these bounds hold. Then for any j3 6 MP with /3j = 0, we have 

U T A T j (Aj (U0 VP 7 ) ~ AjWpj) || 2 

\U\\ sp ■ \\Aj\\ sp ■ (\\Aj\\ sp ■ ju0 - /3*)j - V0$\\ 2 + \\AjWf3j 



< 



J\\2 



E\\ sp ■ y/l6nlog( P ) ■ (v/l6nlog(p) • U(fl - -V0$ + ^16nlog(p) ■ \\W0$ 



E|| sp • 16nlog(p) • (\\U0 - - V/3j 
E\\ sp • 16nlog(p) ■ V§ • (\\U{P - p*)j - Vp- 



\Wf3- 



J\\2 



\Wpj 



Jl\2 



1/2 



S|| sp • 16V2 ■ n log(p) (3*) T nP ~ P*) 



We conclude by proving ( fT7] i and dT8T >. We first prove ( fT7] i using a construction from Keshavan et al. 1 20 10]. 
First, defined = lit G (57^) = Hh < l}- By Remark 5.1 in lKeshavan et al.1 1|2010|| . 

\\Aj\\ sp < V2 sup \u T Ajv\ . 



u.vEU 



For any v E 



therefore, 



Pr (|u t Ajd| > ^/8nlog(p)) < Fr (|JV(0, 1)| > y/8nlog(p)) < e~ 4 " log ^ . 
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Furthermore, \U\ < (2 [8^1 + 1)" < P n ■ So, 

Pr {\\ A As P ^ V 16nl °g(p)) < Pr (\u T Ajv\ < y/8 ■ n log(p) for all u,veU 
> 1 — p 2 ™ e _4nlo s(p) > 1 — g- 2n i°g(p) 



Next we prove ([T8T l. We have 



By Lemma 1 of lLaurent and Massarll ll2000ll , Pr (x£ > 16nlog(p)) < e - 2 « lo g(p). This is sufficient. 



4.2 Proof of Lemma |3] 



Choose any J 2 C Ji with | J 2 | = \-/™/2a~\. Observe that Proj^ (Xj t z) > Projf /2 (Xj 2 z 
sufficient to only consider the sets J 2 of size \V™/2a] . 



, and so it is 



Fix any J 2 C [|V"/o-~|] with |J 2 | = [V"/2o-]. Let P e M^/ 2 "! x l^fa] be the orthogonal projection ma- 
trix corresponding to Proj^J-). Write P£j 2 P = AA T for A € R^/ 2 -! x^/H-i). Then (Xj 2 z) ~ 

AT(0, pill- S, /2 ) and so Proj^ (Xj 2 z) - iV(0, ||z|| 2 • PS,; 2 P), and therefore Proj^ (Xj 2 z) = \\z\\%- Au for 

u ^ N (0,lrvi/ij_i). By examining the definition of A, we see that it T (j4 A)u > ||u|| 2 ■ A^ in (X^y^/^]), 
therefore, 



Furthe rmore, the number of such sets J 2 is bounded by 2 r v/5 7 < *l . By the chi-square tail bounds from lFoygel and Drton 
using the assumption that > 100, we have 



Prix 



< 



-rv^Ao-i-i - 100(T/ 

< exp {i ( - 2) (1 - 0.02 • § + log (0.02 •§))}< cxp { \ {\^2a\ ■ § ) (l - 0.02 • § + log (0.02 ■ §)) } 



Therefore, 



Pr (3J! C [[V^]] ,|Ji| > & UProj^^z)!^ < nAL (X^j 



200ct 



<Pr 3J 2 C [|7q],|J 2 



200(7 



< oT^Al . P r f v 2 < < 2^/"! • p -°-7084[^A1 < -O.Q15rv^Al < -0.015^- ^ 

- Z ^ r ^Xfv^/a<rl-l - 100a) - L e - - 
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